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Abstract 

We show that accelerated gradient descent, averaged gradient descent and the heavy- 
ball method for non-strongly-convex problems may be reformulated as constant pa¬ 
rameter second-order difference equation algorithms, where stability of the system is 
equivalent to convergence at rate 0(l/n^), where n is the number of iterations. We 
provide a detailed analysis of the eigenvalues of the corresponding linear dynamical sys¬ 
tem, showing various oscillatory and non-oscillatory behaviors, together with a sharp 
stability result with explicit constants. We also consider the situation where noisy gra¬ 
dients are available, where we extend our general convergence result, which suggests an 
alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of 
both averaging and acceleration. 


1 Introduction 

Many problems in machine learning are naturally cast as convex optimization problems 
over a Euclidean space; for supervised learning this includes least-squares regression, logistic 
regression, and the support vector machine. Faced with large amounts of data, practitioners 
often favor first-order techniques based on gradient descent, leading to algorithms with 
many cheap iterations. For smooth problems, two extensions of gradient descent have had 
important theoretical and practical impacts: acceleration and averaging. 

Acceleration techniques date back to Nesterov (1983) and have their roots in momentum 
techniques and conjugate gradient (Polyak, 1987). For convex problems, with an appropri¬ 
ately weighted momentum term which requires to store two iterates, Nesterov (1983) showed 
that the traditional convergence rate of 0{l/n) for the function values after n iterations of 
gradient descent goes down to 0(l/n^) for accelerated gradient descent, such a rate being 
optimal among first-order techniques that can access only sequences of gradients (Nesterov, 
2004). Like conjugate gradient methods for solving linear systems, these methods are how¬ 
ever more sensitive to noise in the gradients; that is, to preserve their improved convergence 
rates, significantly less noise may be tolerated (d’Aspremont, 2008; Schmidt et ah, 2011; 
Devolder et ah, 2014). 
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Averaging techniques which consist in replacing the iterates by the average of all iterates 
have also been thoroughly considered, either because they sometimes lead to simpler proofs, 
or because they lead to improved behavior. In the noiseless case where gradients are exactly 
available, they do not improve the convergence rate in the convex case; worse, for strongly- 
convex problems, they are not linearly convergent while regular gradient descent is. Their 
main advantage comes with random unbiased gradients, where it has been shown that they 
lead to better convergence rates than the unaveraged counterparts, in particular because 
they allow larger step-sizes (Polyak and Juditsky, 1992; Bach and Moulines, 2011). For 
example, for least-squares regression with stochastic gradients, they lead to convergence 
rates of 0(l/n), even in the non-strongly convex case (Bach and Moulines, 2013). 

In this paper, we show that for quadratic problems, both averaging and acceleration are two 
instances of the same second-order finite difference equation, with different step-sizes. They 
may thus be analyzed jointly, together with a non-strongly convex version of the heavy- 
ball method (Polyak, 1987, Section 3.2). In presence of random zero-mean noise on the 
gradients, this joint analysis allows to design a novel intermediate algorithm that exhibits 
the good aspects of both acceleration (quick forgetting of initial conditions) and averaging 
(robustness to noise). 

In this paper, we make the following contributions: 

- We show in Section 2 that accelerated gradient descent, averaged gradient descent 
and the heavy-ball method for non-strongly-convex problems may be reformulated as 
constant parameter second-order difference equation algorithms, where stability of the 
system is equivalent to convergence at rate 0(l/n^). 

- In Section 3, we provide a detailed analysis of the eigenvalues of the corresponding linear 
dynamical system, showing various oscillatory and non-oscillatory behaviors, together 
with a sharp stability result with explicit constants. 

- In Section 4, we consider the situation where noisy gradients are available, where we 
extend our general convergence result, which suggests an alternative algorithm (i.e., with 
different step sizes) that exhibits the good aspects of both averaging and acceleration. 

- In Section 5, we illustrate our results with simulations on synthetic examples. 


2 Second-Order Iterative Algorithms for Quadratic Func¬ 
tions 

Throughout this paper, we consider minimizing a convex quadratic function / : —)• M 

dehned as: 

f(e) = ^{e,He)-{q,e), (i) 

with H G a symmetric positive semi-definite matrix and g G M'^. Without loss of 

generality, H is assumed invertible (by projecting onto the orthogonal of its null space), 
though its eigenvalues could be arbitrarily small. The solution is known to be 0* = H~^q, 
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but the inverse of the Hessian is often too expensive to compute when d is large. The excess 
cost function may be simply expressed as f{9n) — f {9*) = \{9n — 0 *, H{9n — 0 *)). 


2.1 Second-order algorithms 

In this paper we study second-order iterative algorithms of the form: 


9n+l — -^n9n T 1 T (2) 

started with 9i = 9 q in with An G Bn G and Cn G for all n G N*. We 

impose the natural restriction that the optimum 0* is a stationary point of this recursion, 
that is, for all n G N*: 

0* = An9^ -|- Bn9^ -t- Cn- (0*-stationarity) 

By letting (j)n = 9n — 9^ we then have (pn+i = An(t>n+Bn(t>n-i-, started from cfiQ = (f)i = 9o—9^. 
Thus, we restrict our problem to the study of the convergence of an iterative system to 0. 

In connection with accelerated methods, we are interested in algorithms for which f{9n) — 
f{9^) = ^{(j)n,H(j)n) converges to 0 at a speed of O {l/v?). Within this context we impose 
that An and Bn have the form : 

An = —^—A and Bn = —- B Vn G N with A,B^ (n-scalability) 

n -|- 1 n + 1 


By letting r]n = n4>n = n{9n — 9^), we can now study the simple iterative system with 
constant terms rjn+i = Arjn + Brjn-i-, started at ryo = 0 and rji = 9q — 9^,. Showing that 
the function values remain bounded, we directly have the convergence of f{9n) to f{9^) at 
the speed O (l/n^). Thus the n-scalability property allows to switch from a convergence 
problem to a stability problem. 

For feasibility concerns the method can only access H through matrix-vector products. 
Therefore A and B should be polynomials in H and c a polynomial in H times q, if possible 
of low degree. The following theorem clarifies the general form of iterative systems which 
share these three properties (see proof in Appendix B). 

Theorem 1. Let {Pn,Qn, Rn) G (M[X])^ for all n £ N, be a sequence of polynomials. If the 
iterative algorithm defined by Eq. (2) with An = Pn{H), Bn = Qn{H) and Cn = R{H)q sat¬ 
isfies the 9^-stationarity and n-scalability properties, there are polynomials {A,B) G (M[A])^ 
such that: 


An = 2 - 


n 


Bn = — 


n-f- 1 
n — 1 




n -|- 1 
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I - B{H)HJ and Cn = 


nA{H) + B{H)\ 
n + l J^' 


Note that our result prevents An and Bn from being zero, thus requiring the algorithm to 
strictly be of second order. This illustrates the fact that first-order algorithms as gradient 
descent do not have the convergence rate in 0(l/n^). 
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We now restrict our class of algorithms to lowest possible order polynomials, that is, A = al 
and B = f3I with {a, f3) G M^, which correspond to the fewest matrix-vector products per 
iteration, leading to the constant-coefficient recursion for ijn = n4>n = n{0n — 9*)■ 

rjn+i = {I - aH) rin + {I - PH) {rjn - Vn-i) ■ (3) 


Expression with gradients of /. The recursion in Eq. (3) may be written with gradients 
of / in multiple ways. In order to preserve the parallel with accelerated techniques, we 
rewrite it as: 


^n+l 


^ n-1 ^ na + P nja ^ {n-l)P ^ \ 

n + l” n + n + 1 \ na-\- [3 ” na(3 


(4) 


It may be interpreted as a modified gradient recursion with two potentially different affine 
(i.e., with coefficients that sum to one) combinations of the two past iterates. This refor¬ 
mulation will also be crucial when using noisy gradients. The allowed values for (a, /3) G 
will be determined in the following sections. 


2.2 Examples 

Averaged gradient descent. We consider averaged gradient descent (referred to from 
now on as “Av-GD”) (Polyak and Juditsky, 1992) with step-size 7 G M defined by: 

n+1 

ljjn+l=1pn-lf{'lpn), 6'n+l = - 

n-\- 1 

i=\ 


When computing the average online as 9n+i = On + — 9n) and seeing the average 

as the main iterate, the algorithm becomes (see proof in Appendix B.2): 


9n+l 


2n 

n-\- 1 


re — 1 
re -|- 1 


9n—l 


- {n- l)9n-l)- 


This corresponds to Eq. (4) with a = 0 and /3 = 7 . 


Accelerated gradient descent. We consider the accelerated gradient descent (referred 
to from now on as “Acc-GD”) (Nesterov, 1983) with step-sizes : 

^n+l — a^n 'yf (^n)j — ^n T ^n—l)- 

Eor smooth optimization the accelerated literature (Nesterov, 2004; Beck and Teboulle, 
2009) uses the step-size = 1 — and their results are not valid for bigger step-size 
6n- However dn = 3 — 777 is compatible with the framework of Lan (2012) and is more 
convenient for our set-up. This corresponds to Eq. (4) with a = 7 and /3 = 7 . Note that 
accelerated techniques are more generally applicable, e.g., to composite optimization with 
smooth functions (Nesterov, 2013; Beck and Teboulle, 2009). 
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Heavy ball. We consider the heavy-ball algorithm (referred to from now on as “HB”) 
(Polyak, 1964) with step-sizes ( 7 , 5n) G : 

Gn+l = On - if'{On) + 6n{0n “ On-l), 

when = 1 — We note that typically 6n is constant for strongly-convex problems. 
This corresponds to Eq. (4) with 0 = 7 and /3 = 0. 


3 Convergence with Noiseless Gradients 


We study the convergence of the iterates defined by: r]n+i = (/ — aH) rin+{I — f3H) {rjn — 7n-i)- 
This is a second-order iterative system with constant coefficients that it is standard to cast 
in a linear framework (see, e.g., Ortega and Rheinboldt, 2000). We may rewrite it as: 


Oyi — FQn—i^ with 0 ^ 


(J 7 ) and F={^‘- <7 « ^ <*7 ') e 


Thus Qn = Following O’Donoghue and Candes (2013), if we consider an eigenvalue 

decomposition of i.e., H = PDiag(/i)P^ with P an orthogonal matrix and (7) the 
eigenvalues of H, sorted in decreasing order: = L > h^-i > • • • > /12 > hi = /r > 0, then 

Eq. (3) may be rewritten as: 

= (I - aDiag (h)) + {I - /3Diag (h)) (P"^? 7 n - P"^ 7 n-i) • (5) 


Thus there is no interaction between the different eigenspaces and we may consider, for the 
analysis only, d different recursions with rjl^ = pjrjn, i G { 1 , ...,d}, where pi G is the i-th 
column of P: 


7n+i = (1 - «h*) Tln + {1- (dhi) {rf^ - r/^i) . 


( 6 ) 


3.1 Characteristic polynomial and eigenvalues 


In this section, we consider a fixed i G {1,..., d} and study the stability in the corresponding 
eigenspace. This linear dynamical system may be analyzed by studying the eigenvalues of 

'2- {a + I3)hi 

1 0 

characteristic polynomial which is: 


the 2 X 2-matrix P,- = 


These eigenvalues are the roots of its 


det(X/-P) = det {X{X-2 + {a + (3)hi) + 1 - /37) = X‘^-2x{l- +1-/37. 


To compute the roots of the second-order polynomial, we compute its reduced discriminant: 

A. = (1 - (7^)'..)' - 1 + /3A. = '•i((7^)''>< - »)■ 

Depending on the sign of the discriminant Aj, there will be two real distinct eigenvalues 
(A, > 0 ), two complex conjugate eigenvalues (A* < 0 ) or a single real eigenvalue (Aj = 0 ). 
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Figure 1: Area of stability of the algorithm, with the three traditional algorithms repre¬ 
sented. In the interior of the triangle, the convergence is linear. 


We will now study the sign of Aj. In each different case, we will determine under what 
conditions on a and /3 the modulus of the eigenvalues is less than one, which means that the 
iterates ('r]n)n remain bounded and the iterates (0n)n converge to 0*. We may then compute 
function values as f(0n) - f{0*) = ^ = 2 

The various regimes are summarized in Figure 1: there is a triangle of values of {ahi,l3hi) 
for which the algorithm remains stable (i.e., the iterates {r]n)n do not diverge), with either 
complex or real eigenvalues. In the following lemmas (see proof in Appendix C), we provide 
a detailed analysis that leads to Figure 1. 

Lemma 1 (Real eigenvalues). The discriminant Aj is strictly positive and the algorithm is 
stable if and only if 

a > 0, a + 2(3 < A/hi, a + /3 > 2\/a/hi. 

We then have two real roots rf = r* ± with r* = 1 — (^^)/ij. Moreover, we have: 

. 2 . W'i)"'*. [(’■. + va;)" - in - 

"• 4„2 Aj ■ ' ' 


Therefore, for real eigenvalues, {{(l)\/f‘hi)n will converge to 0 at a speed of 0(l/n^) however 
the constant Aj may be arbitrarily small (and thus the scaling factor arbitrarily large). 
Furthermore we have linear convergence if the inequalities in the lemmas are strict. 


Lemma 2 (Complex eigenvalues). The discriminant A, is stricly negative and the algorithm 
is stable if and only if 

«>0, fd > 0, a + (3 < y/a/hi- 

We then have two complex conjugate eigenvalues: rf = Vi Az Moreover, we 

have: 


i^nfhi 


sin^(a;4n) on 
^a-{^yhi)^ 


(8) 


with Pi = \/l — (3hi, and Ui defined through sin(ti;j) = \/—Ai/pi and cos(a;j) = ri/pi. 
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Therefore, for complex eigenvalues, there is a linear convergence if the inequalities in the 
lemma are strict. Moreover, {{4>'nf‘hi)n oscillates to 0 at a speed of 0(l/n^) even if hi is 
arbitrarily small. 


Coalescing eigenvalues. When the discriminant goes to zero in the explicit formulas of 
the real and complex cases, both the denominator and numerator of {{<Ph)‘^hi)n will go to 
zero. In the limit case, when the discriminant is equal to zero, we will have a double real 
eigenvalue. This happens for (3 = 2^y a/hi—a. Then the eigenvalue is = l — ^/ahi, and the 
algorithm is stable for 0 < a < 4//ij, we then have {(j)U‘^hi = hi{(j)\)‘^{l — This 

can be obtained by letting Aj goes to 0 in the real and complex cases (see also Appendix C.3). 


Summary. To conclude the iterate {rjl^)n = (ra(^n “ ^*))n will be stable for a € [0,4//ij] 
and /3 G [0, 2/hi — ot/2\. According to the values of a and /? this iterate will have a different 
behavior. In the complex case, the roots are complex conjugate with magnitude t/1 ~ Phi. 
Thus, when /3 > 0, {r]l/)n will converge to 0, oscillating, at rate \/l — f3hi. In the real 
case, the two roots are real and distinct. However the product of the two roots is equal 
to x/l — phi, thus one will have a higher magnitude and will converges to 0 at rate 

higher than in the complex case (as long as a and (3 belong to the interior of the stability 
region). 

Finally, for a given quadratic function /, all the d iterates {ini\/)n should be bounded, therefore 
we must have a G [0, 4/L] and /3 G [0, 2/L — a/2]. Then, depending on the value of hi, some 
eigenvalues may be complex or real. 

3.2 Classical examples 

For particular choices of a and /3, displayed in Figure 1, the eigenvalues are either all real 
or all complex, as shown in the table below. 



Av-GD 

Acc-GD 

Heavy ball 

a 

/3 

Ai 

1 

cos(wj) 

Pi 

o 

1 Ao 

7 

7 

-7/ii(l - y/ii) 

VI - 
VI - ihi 

VI - ihi 

7 

0 

- A) 

1 - '^hi 

1 


Averaged gradient descent loses linear convergence for strongly-convex problems, because 
r/~ = 1 for all eigensubspaces. Similarly, the heavy ball method is not adaptive to strong 
convexity because pi = 1. However, accelerated gradient descent, although designed for 
non-strongly-convex problems, is adaptive because pi = y/l — jhi depends on hi while a 
and f3 do not. These last two algorithms have an oscillatory behavior which can be observed 
in practice and has been already studied (Su et al., 2014). 
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Note that all the classical methods choose step-sizes a and /3 either having all the eigenvalues 
real either complex; whereas we will see in Section 4, that it is signihcant to combine both 
behaviors in presence of noise. 


3.3 General bound 


Even if the exact formulas in Lemmas 1 and 2 are computable, they are not easily inter¬ 
pretable. In particular when the two roots become close, the denominator will go to zero, 
which prevents from bounding them easily. When we further restrict the domain of (a,/3), 
we can always bound the iterate by the general bound (see proof in Appendix D): 

Theorem 2. For a < l/Iij and 0 < /3 < 2/hi — oi, we have 


ivhf < min 


2 (^1)^ 8{r]\)‘^n 16(7?|)^ I 

ahi ’ {a + I3)hi ’ (a -h / ’ 


(9) 


These bounds are shown by dividing the set of (a, /3) in three regions where we obtain 
specific bounds. They do not depend on the regime of the eigenvalues (complex or real); 
this enables us to get the following general bound on the function values, our main result 
for the deterministic case. 

Corollary 1. For a < 1/L and 0 < /3 < 2/L — a: 


f{0n) - f{d*) < min 


11^0-g.f 

an^ ’ {a + f3)n j 


( 10 ) 


We can make the following observations: 

- The first bound corresponds to the traditional acceleration result, and is only 

relevant for a > 0 (that is, for Nesterov acceleration and the heavy-ball method, but 
not for averaging). We recover the traditional convergence rate of second-order methods 
for quadratic functions in the singular case, such as conjugate gradient (Polyak, 1987, 
Section 6.1). 

- While the result above focuses on function values, like most results in the non-strongly 
convex case, the distance to optimum \\9n — typically does not go to zero (although 
it remains bounded in our situation). 

- When 0 = 0 (averaged gradient descent), then the second bound provides a 

convergence rate of 0(l/n) if no assumption is made regarding the starting point 9q, 

while the last bound of Theorem 2 would lead to a bound \ that is a 

rate of 0(l/n^), only for some starting points. 

- As shown in Appendix E by exhibiting explicit sequences of quadratic functions, the 
inverse dependence in an^ and (a -|- /3)n in Eq. (10) is not improvable. 











Figure 2: Trade-off between averaged and accelerated methods for noisy gradients. 


4 Quadratic Optimization with Additive Noise 

In many practical situations, the gradient of / is not available for the recursion in Eq. (4), 
but only a noisy version. In this paper, we only consider additive uncorrelated noise with 
finite variance. 

4.1 Stochastic difference equation 

We now assume that the true gradient is not available and we rather have access to a noisy 
oracle for the gradient of /. In Eq. (4), we assume that the oracle outputs a noisy gradient 
f ( ^n-i) — £n+i- The noise (sn) is assumed to be uncorrelated zero-mean 
with bounded covariance, i.e., (g) Em] = 0 for all n / m and (g) e„] ^ C, where 

A ^ B means that B — A is positive semi-definite. 

For quadratic functions, for the reduced variable rjn = ncpn = n{6n — 0*), we get: 

rjn+i = {I - OiH)r]n + {I - /3H){r]n - Vn-i) + [na + /3]En+i. ( 11 ) 

Note that algorithms with a ^ 0 will have an important level of noise because of the term 
naSn+i- We denote by +^]en-i-i | have the recursion: 


©n+l — T ?n+l) 


( 12 ) 


which is a standard noisy linear dynamical system (see, e.g., Arnold, 1998) with uncorrelated 
noise process {^n)- We may thus express 0n directly as 0n = + and its 

expected second-order moment as, E(0„0„)''' = F^QqQq +Ylk=i ■ 


In order to obtain the expected excess cost function, we simply need to compute tr 



E(0n0n)^, 


which thus decomposes as a term that only depends on initial conditions (which is exactly 
the one computed and studied in Section 3.3), and a new term that depends on the noise. 
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4.2 Convergence result 

For a quadratic function / with arbitrarily small eigenvalues and uncorrelated noise with 
finite covariance, we obtain the following convergence result (see proof in Appendix F); 
since we will allow the parameters a and /? to depend on the time we stop the algorithm, 
we introduce the horizon N: 

Theorem 3 (Convergence rates with noisy gradients). With (8) = C for all n G N, 

for a < ^ and 0 < (3 < ^ — a. Then for any A" G N, we have: 


E/(07v) - fm < 


min 


11 ^ 0 iaN + /3)\ 

aA2 ^ aN (a +/3)A 


+ 


4(aA + /3)2 
ex j3 



(13) 


We can make the following observations: 


- Although we only provide an upper-bound, the proof technique relies on direct moment 
computations in each eigensubspace with few inequalities, and we conjecture that the 
scalings with respect to n are tight. 

- For 0 = 0 and (3 = 1/L (which corresponds to averaged gradient descent), the second 

bound leads to ^*11 — 1 _ which is bounded but not converging to zero. We 

recover a result from Bach and Moulines (2011, Theorem 1). 

- For a = (3 = 1/L (which corresponds to Nesterov’s acceleration), the first bound leads 

to —I" bound suggests that the algorithm diverges, which 

we have observed in our experiments in Appendix A. 

“ For a = 0 and f3 = l/L\/iV, the second bound leads to , and we 

recover the traditional rate of 1 /xfN for stochastic gradient in the non-strongly-convex 
case. 


- When the values of the bias and the variance are known we can choose a and (3 such 
that the trade-off between the bias and the variance is optimal in our bound, as the 
following corrollary shows. Note that in the bound below, taking a non zero (3 enables 
the bias term to be adaptive to hidden strong-convexity. 

Corollary 2. For a = min | , 1/l| and (3 G [0,min{AQ:, 1/T}], we have: 


e /( 07 v ) - f{e,) < 


2L||0o-0*f 4\/t7C||0o - 0* 

^ VN 


4.3 Structured noise and least-square regression 

When only the noise total variance tr(C') is considered, as shown in Section 4.4, Corollary 
2 recover existing (more general) results. Our framework however leads to improved result 
for structured noise processes frequent in machine learning, in particular in least-squares 
regression which we now consider but this goes beyond (see, e.g. Bach and Moulines, 2013). 
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Assume we observe independent and identically distributed pairs {xn,yn) G x M and 
we want to minimize the expected loss f{9) = — {0,Xn))^]- We denote hy H = 

E(xn ® Xn) the covariance matrix which is assumed invertible. The global minimum of / 
is attained at 0* G dehned as before and we denote by Vn = Un — {9*,Xn) the statistical 
noise, which we assume bounded by a. We have E[rn,Xn] = 0. In an online setting, we 
observe the gradient {xn®Xn){9 — 9^) — rnXn, whose expectation is the gradient f'{9). This 
corresponds to a noise in the gradient of Sn = {H — Xn® Xn){9 — 9^) + Given 9, if 

the data (xn,yn) are almost surely bounded, the covariance matrix of this noise is bounded 
by a constant times H. This suggests to characterize the noise convergence by 
which is bounded even though H has arbitrarily small eigenvalues. 

However, our result will not apply to stochastic gradient descent (SGD) for least-squares, 
because of the term {H — Xn® Xn){9 — 9^) which depends on 0, but to a “semi-stochastic” 
recursion where the noisy gradient is H{9 — 9^) — VnXn, with a noise process 
which is such that E[eri ® £n] ^ a^H, and has been used by Bach and Moulines (2011) and 
Dieuleveut and Bach (2014) to prove results on regular stochastic gradient descent. We 
conjecture that our algorithm (and results) also applies in the regular SGD case, and we 
provide encouraging experiments in Section 5. 

For this particular structured noise we can take advantage of a large /3: 

Theorem 4 (Convergence rates with structured noisy gradients). Let ct < and 0 < /? < 
For any G N, E/( 07 v) — /(0*) is upper-bounded by: 


min 


1100-g, IP iaN + f3)\ 1 4LII0O-0PP 8{aN + 

N^a {a + f3)N {a + ld)^N 


(14) 


We can make the following observations: 


- For 0 = 0 and (3 = 1/L (which corresponds to averaged gradient descent), the second 
bound leads to ^F\So-s*\\ — 1 _ 8 tr((^ —recover a result from Bach and Moulines 
(2013, Theorem 1). Note that when C ^ ^ a'^d. 


- For a = (3 = 1/L (which corresponds to Nesterov’s acceleration), the first bound leads 

to —h which is bounded but not converging to zero (as opposed to 

the the unstructured noise where the algorithm may diverge). 


For a = 1/(LA^“) with 0 < a < 1 and (3 = 1/L, the first bound leads to H 

. We thus obtain an explicit bias-variance trade-off by changing the value of a. 




N°- 


- When the values of the bias and the variance are known we can choose a and (3 with 
an optimized trade-off, as the following corrollary shows: 


Corollary 3. For a = min 


lieo-edl 

y/Ltr(CH-^)N 



and (3 = minjA'a, 1/L} we have: 


IE/(0Ar) - /(0*) < max 


5tr(CA-i) 5v^tr(CA-i)L||0o -0* 
N ’ N 


2 || 0 o- 0 *|pT 

A 2 


(15) 
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4.4 Related work 


Acceleration and noisy gradients. Several authors (Lan, 2012; Hu et al., 2009; Xiao, 
2010 ) have shown that using a step-size proportional to 1/A^/^ accelerated methods with 

noisy gradients lead to the same convergence rate of —h than 

in Corollary 2, for smooth functions. Thus, for unstructured noise, our analysis provides 
insights in the behavior of second-order algorithms, without improving bounds. We get 
significant improvements for structured noises. 


Least-squares regression. When the noise is structured as in least-square regression and 
more generally in linear supervised learning, Bach and Moulines (2011) have shown that 
using averaged stochastic gradient descent with constant step-size leads to the convergence 
rate of has been highlighted by Defossez and Bach (2014) that the bias 

term often be the dominant one in practice. Our result in Corollary 3 leads 

to an improved bias term in 0(1/N'^) with the price of a potentially slightly worse constant 
in the variance term. However, with optimal constants in Corollary 3, the new algorithm 
is always an improvement over averaged stochastic gradient descent in all situations. If 
constants are unknown, we may use a = 1/(LA“) with 0 < a < 1 and (3 = 1/L and we 
choose a depending on the emphasis we want to put on bias or variance. 


Minimax convergence rates. For noisy quadratic problems, the convergence rate nicely 
decomposes into two terms, a bias term which corresponds to the noiseless problem and 
the variance term which corresponds to a problem started at 0*. For each of these two 
terms, lower bounds are known. For the bias term, if A < d, then the lower bound is, 
up to constants, L\\6q — (Nesterov, 2004, Theorem 2.1.7). For the variance term, 

for the general noisy gradient situation, we show in Appendix H that for N < d, it is 
{ti C)/{L^/N), while for least-squares regression, it is a'^d/N (Tsybakov, 2003). Thus, for 
the two situations, we attain the two lower bounds simultaneously for situations where 
respectively L\\6q — < {tr C)/L and L\\9q — 9^\\‘^ < da"^. It remains an open problem 

to achieve the two minimax terms in all situations. 


Other algorithms as special cases. We also note as shown in Appendix G that in 
the special case of quadratic functions, the algorithms of Lan (2012); Hu et al. (2009); 
Xiao (2010) could be unified into our framework (although they have significantly different 
formulations and justifications in the smooth case). 


5 Experiments 

In this section, we illustrate our theoretical results on synthetic examples. We consider 
a matrix H that has random eigenvectors and eigenvalues 1/A:™', for k = l,...,(i and 
m G N. We take a random optimum 9^ and a random starting point 9q such that r = 
11^0 — ^*11 = 1 (unless otherwise specified). In Appendix A, we illustrate the noiseless results 
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Structured noisy gradients, d=20 


Structured noisy gradients, d=20 
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Figure 3: Quadratic optimization with regression noise. Left a = 1, r = 1. Right a = 0.1, 
r = 10. 


of Section 3, in particular the oscillatory behaviors and the influence of all eigenvalues, 
as well as unstructured noisy gradients. In this section, we focus on noisy gradients with 
structured noise (as described in Section 4.3), where our new algorithms show significant 
improvements. 

We compare our algorithm to other stochastic accelerated algorithms, that is, AC-SA (Lan, 
2012), SAGE (Hu et ah, 2009) and Acc-RDA (Xiao, 2010) which are presented in Ap¬ 
pendix G. For all these algorithms (and ours) we take the optimal step-sizes defined in 
these papers. We show results averaged over 10 replications. 


Homoscedastic noise. We first consider an i.i.d. zero mean noise whose covariance ma¬ 
trix is proportional to ff. We also consider a variant of our algorithm with an any-time 
step-size function of n rather than N (for which we currently have no proof of convergence). 
In Figure 3, we take into account two different set-ups. In the left plot, the variance domi¬ 
nates the bias (with r = ||0o ~ = <7). We see that (a) Acc-GD does not converge to the 

optimum but does not diverge either, (b) Av-GD and our algorithms achieve the optimal 
rate of convergence of 0{a‘^d/n), whereas (c) other accelerated algorithms only converge at 
rate 0{l/^/n). In the right plot, the bias dominates the variance (r = 10 and a = 0.1). In 
this situation our algorithm outperforms all others. 


Application to least-squares regression. We now see how these algorithms behave for 
least-squares regressions and the regular (non-homoscedastic) stochastic gradients described 
in Section 4.3. We consider normally distributed inputs. The covariance matrix H is the 
same as before. The outputs are generated from a linear function with homoscedatic noise 
with a signal-to-noise ratio of a. We consider d = 20. We show results averaged over 10 
replications. In Figure 4, we consider again a situation where the bias dominates (left) 
and vice versa (right). We see that our algorithm has the same good behavior than in the 
homoscedastic noise case and we conjecture that our bounds also hold in this situation. 
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Least-Square Regression, d=20 


Least-Square Regression, d=20 





o 

o 



Figure 4: Least-Square Regression. Left u = 1, r = 1. Right a = 0.1, r = 10. 


6 Conclusion 

We have provided a joint analysis of averaging and acceleration for non-strongly-convex 
quadratic functions in a single framework, both with noiseless and noisy gradients. This 
allows to define a class of algorithms that can benefit simultaneously of the known improve¬ 
ments of averaging and accelerations: faster forgetting of initial conditions (for acceleration), 
and better robustness to noise when the noise covariance is proportional to the Hessian (for 
averaging). 

Our current analysis of our class of algorithms in Eq. (4), that considers two different affine 
combinations of previous iterates (instead of one for traditional acceleration), is limited 
to quadratic functions; an extension of its analysis to all smooth or self-concordant-like 
functions would widen its applicability. Similarly, an extension to least-squares regression 
with natural heteroscedastic stochastic gradient, as suggested by our simulations, would be 
an interesting development. 
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A Additional experimental results 

In this appendix, we provide additional experimental results to illustrate our theoretical 
results. 

A.l Deterministic convergence 

Comparaison for d = 1. In Figure 5, we minimize a one-dimensional quadratic function 
f{6) = ^0^ for a fixed step-size a = 1/10 and different step-sizes /3. In the left plot, we 
compare Acc-GD, HB and Av-GD. We see that HB and Acc-GD both oscillate and that 
Acc-GD leverages strong convexity to converge faster. In the right plot, we compare the 
behavior of the algorithm for different values of /?. We see that the optimal rate is achieved 
for /3 = /3* defined to be the one for which there is a double coalescent eigenvalue, where 
the convergence is linear at speed 0(1 — \/otL)'^. When /3 > /?*, we are in the real case and 
when P < the algorithm oscillates to the solution. 


Comparison between the different eigenspaces. Figure 6 shows interactions between 
different eigenspaces. In the left plot, we optimize a quadratic function of dimension d = 2. 
The first eigenvalue is L = 1 and the second is ^ = 2“®. For Av-GD the convergence is 
of order 0(l/n) since the problem is “not” strongly convex (i.e., not appearing as strongly 
convex since np. remains small). The convergence is at the beginning the same for HB 
and Acc-GD, with oscillation at speed 0(l/n^), since the small eigenvalue prevents Acc- 
GD from having a linear convergence. Then for large n, the convergence becomes linear 
for Acc-GD, since fvn becomes large. In the right plot, we optimize a quadratic function 
in dimension d = 5 with eigenvalues from 1 to 0.1. We show the function values of the 
projections of the iterates tjn on the different eigenspaces. We see that high eigenvalues first 
dominate, but converge quickly to zero, whereas small ones keep oscillating, and converge 
more slowly. 
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Minimization of f(0)=0^/2 



Figure 5: Deterministic case for d = 1 and a 
different oscillatory behaviors. 


Minimization of f(0)=0^/2 





Figure 6 : Left: Deterministic quadratic optimization for d = 2. Right: Function value of 
the projection of the iterate on the different eigenspaces (d = 5). 


d=20, hi=i“^ 



Figure 7: Deterministic case for d = 20 


d=20, hi=i“® 



Comparison for d = 20. In Figure 7, we optimize two 20-dimensional quadratic functions 
with different eigenvalues with Av-GD, HB and Acc-GD for a hxed step-size 7 = 1/10. In 
the left plot, the eigenvalues are l/k'^ and in the right one, they are 1/A:®, for k = 1,... ,d. 
We see that in both cases, Av-GD converges at a rate of 0(l/n) and HB at a rate of 
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0(l/n^). For Acc-GD the convergence is linear when // is large (left plot) and becomes 
sublinear at a rate of 0(l/n^) when /r becomes small (right plot). 

A.2 Noisy convergence with unstructured additive noise 

We optimize the same quadratic function, but now with noisy gradients. We compare our 
algorithm to other stochastic accelerated algorithms, that is, AC-SA (Lan, 2012), SAGE 
(Hu et ah, 2009) and Acc-RDA (Xiao, 2010), which are presented in Appendix G. For all 
these algorithms (and ours) we take the optimal step-sizes dehned in these papers. We plot 
the results averaged over 10 replications. 

We consider in Figure 8 an i.i.d. zero mean noise of variance C = I. We see that all the 
accelerated algorithms achieve the same precision whereas Av-GD with constant step-size 
does not converge and Acc-Gd diverges. However SAGE and AG-SA are anytime algorithms 
and are faster at the beginning since their step-sizes are decreasing and not a constant (with 
respect to n) function of the horizon N. 


Noisy gradients, d=20, hi=i‘^ 



B Proofs of Section 2 

B.l Proof of Theorem 1 

Let {Pn,Qn, Rn) £ (^^[A])^ for all n G N be a sequence of polynomials. We consider the 
iterates dehned for all n G N* by 

9n+l = Pn{H)9n + Qn{H)9n-l + RiH)q, 

started from 0o = The 0*-stationarity property gives for n G N*: 

9, = Pn{H)9, + Qn{H)9, + Rn{H)q. 
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Since 9^ = H ^(7 we get for all g G 

H-\ = Pn{H)H-\ + Qn{H)H-\ + Rn{H)q. 

For all g G we apply this relation to vectors q = Hq: 

q = Pn{H)q + QniH)q + Rn{H)Hq Vg G 

and we get 

I = PniH) + Qn{H) + RniH)H VnGN*. 

Therefore there are polynomials (P^, Qn) £ g^ £ IK for all n G N* such that we 

have for all n G N: 


Pn{X) = {l-qn)I + XPr,iX) 

QniX) = qnI + XQn{X) 

Rn{X) = -{Pn{X)+Qn{X)). (16) 

The n-scalability property means that there are polynomials {P,Q) G (M[X])^ independent 
of n such that: 

71 — 1 

QniX) = -—QiX). 

n + 1 

And in connection with Eq. (16) we can rewrite P and Q as: 


PiX)=p + XPiX), 
QiX) = q + XQiX), 


with (p, g) G and (P, Q) G (M[X])^. Thus for all n G N: 


n — 1 


Qn = 


-q 


n + l 

— 71 — 1 

QniX) = - QiX) 


n 


n 


n + l 

PniX) = 


p = (1 - qn) 


n 


n + l 


PiX). 


Eq. (17) and Eq. (18) give: 


n 


n + l 


p = 1 - 


Thus for n = 1, we have p = 2. Then — ^qrjg = — 1 = and g 


PniH) 

QniH) 

RniH) 


72—1 
72 - 

2 n 
n — 1 


n — 1 
n + l* 

272 


72+1 

n 


-I + ^^PiH)H 

n+l n+l ^ ’ 


I + QiH)H 


PiH) + in-l)QiH) \ 

n + l / ‘ 


(17) 


(18) 


— 1. Therefore 
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We let A = — (P + Q) and B = Q so that we have: 

2n 


79 _ 1 _ 

QniH) = - 


Rn{H) = 


nA{H) + B{H)\ 


n + 1 

and with (pn = (^n — for all n G N, the algorithm can be written under the form: 

2 


4^n+l — 


I- 


” A{H) + -^B{H) ] H 


n + 1 


n + 1 


+ 1 - 


n + 1 


[I-B{H)H] {cj)n-cj): 


n—l) 


B.2 Av-GD as two steps-algorithm 

We show now that when the averaged iterate of Av-GD is seen as the main iterate we have 
that Av-GD with step-size 7 G M is equivalent to: 


a _ ^ n-1^ 7 

“n+l — Vn-1 — 


n -|- 1 


n -|- 1 


n -I- 1 


/'(n^n - (n - 


We remind 


^pn+l = Ipn-lfilpn), 

(^n+1 — H , r(t^n-|-l ^n)' 

n + 1 


Thus, we have: 


^n+l — H- ~rr{'lpn+l ~ Sn) 

n + 1 

= 6'nH- -On) 

= 0n-\ - ]—r{0n + {n- l){9n - 0n-l) “ 7f'{0n + (« “ l)(6*n “ 6'n-l)) “ On) 

n + 1 


2n ^ n-1^ 7 

, . “n-1 . 

n-|-l n-|-l n-|-l 


f'{nOn - (n - 


C Proof of Section 3 


C.l Proof of Lemma 1 

The discriminant Aj is strictly positive when — a > 0. This is always true for a 

strictly negative. For a positive and for hi / 0, this is true for > ^fajhi . Thus the 

discriminant Aj is strictly positive for 


a < 0 
a > 0 


or 

and 


|/? < —a — 2^a/hi or /3 > —a + 2y/a/hi^ . 
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Figure 9: Stability in the real case, with all contraints plotted. 


Then we determine when the modulus of the eigenvalues is less than one (which corresponds 
to -1 < r~ < rf < 1 ). 


r+ < 1 4^ 


\ 


OL ^ 


ho 


jS (y. 


ho-a] < 


hi-a \ < 


/3 “h CK 


^ CM. 


hi 


1 2 


and 


. cx 3 

44 hia > 0 and - > 0 

* - 2 - 

44 a > 0 and a + /3 > 0. 
Moreover, we have : 


r. > -1 44 


1 




/3 Q. 


hi — a \ < 2 — 


44 hi 
44 hi 


/? T rr 
2 

/3 + a 

2 


hi- a \ < 


2 - 


44 -hia <4-4 


hi — a j <4 — 4 
j3 a 


/? + a 

2 

13 -\- a 

2 

/? + a 


hj 

hj 

+ 


and 2 — ( 1 hj > 0 


13 + a 


2 

1 2 


and 


(3 + a 

2 


/3 a 

2 .hi and - < 2/hi - ^ 


"t4 /3 < 2/hi — a/2 and (3 < 3,/hi — a. 


Figure 9 (where we plot all the constraints we have so far) enables to conclude that the 
discriminant Aj is strictly positive and the algorithm is stable when the following three 


< ‘2/hi 
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conditions are satisfied: 


a > 0 

a+ 213 < 4//ij 
a + /3 > 2y^a/hi. 


For any of those a et f3 we will have: 

Vn = ciir^ ) +C2{rl) . 

Since = 0, ci + C 2 = 0 and for n = 1, ci = r]\/{r~ — rf)-, we thus have: 

i ^ vi (rtf - {rjT 

2 • 


Thus, we get the final expression: 

* 4n2 


C.2 Proof of Lemma 2 



Figure 10: Stability in the complex case, with all constraints plotted. 

The discriminant Aj is strictly negative if and only if — a < 0. This implies 

< s/ajhi. The modulus of the eigenvalues is \rf\'^ = 1 — phi. Thus the discriminant 
Aj is strictly negative and the algorithm is stable for 

a, (3 > 0 
a + /3 < ^/a/hi, 


as shown in Figure 10. 
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For any of those a et /3 we have: 

rf^ = [ci cos(cjjn) + C2 sm{ujin)]pi, 

with Pi = yjl — (3hi, sin(a;j) = \J—ts.ilpi and cos(cji) = Vi/pi. Since r^g = 0, ci = 0 and we 
have for n = 1, C 2 = p\/{su\(ijji)pi). Therefore 


and 


= ^ 1 - 


i sin(cjjn) 
\/—tSi 






C.3 Coalescing eigenvalues 


When /3 = 2^ajhi — a, the discriminant Aj is equal to zero and we have a double real 
eigenvalue: 

r* = 1 - \f^i. 

Thus the algorithm is stable for a < ^. For any of those a et /? we have: 

= (ci + nc 2 )r^. 

This gives with ryg = 0, c\ = 0 and C 2 = p\lr. Therefore 

'ffn = 

and: 

{4>nfhi = hi{(l)\f{l - 


In the presence of coalescing eigenvalues the convergence is linear if 0 < a < A/hi and 
hi > 0, however one might worry about the behavior of {{(pl^)‘^hi)n when hi becomes small. 
Using the bound a;^exp(—x) < 1 for x < 1, we have for a < A/hp 

hi{l - y^ahlf"^ = hi exp{2nlog{\l - y^ahi\)) 

< hi exp(— 2 n min{ ahi, 2 — y/ahi}) 

hi 

< -- -- — 

min{\/a/ij, 2 — \/a/ij}2 


< 


max < — 


hi 


a’ (2 — yJahiY 

Therefore we always have the following bound for a < A/hp 


^2 


fhi < 


(</>! 


i \2 


Av? 


1 

max < —, 


a ’ (2 — y/ahlY 


Thus for ahi < 1 we get: 


^n?h^ < 


Ml!, 

An? a 
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D Proof of Theorem 2 


D.l Sketch of the proof 






We divide the domain of validity of Theorem 2 in three subdomains as explained in Figure 
14. On the domain described in Figure 11 we have a first bound on the iterate rj'^: 

Lemma 3. For 0 < a < 1/hi and 1 — — ahi < (3hi < 1 + \/l — ahi, we have: 


ivif < 


Ml!. 

ahi 


And on the domain described Figure 12 we also have: 
Lemma 4. For 0 < a < l//ij and (3 < a we have: 


ihn? < 


2M)^ 

ahi 


These two lemmas enable us to prove the first bound of Theorem 2 since the domain of 
this theorem is included in the intersection of the two domains of these lemmas as shown 
in Figure 14. 

Then we have the following bound on domain described in Figure 13: 
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Lemma 5. For 0 < a < 2/hi and 0 < /3 < 2/hi — a, we have: 

2y/^ 4 1 


< min 


y^{a + (3)hi’ (a + f^)hi j 


Since the domain of definition of Theorem 2 is included in the domain of definition of Lemma 
5 (as shown in Figure 14), this lemma proves the last two bounds of the theorem. 


D.2 Outline of the proofs of the Lemmas 

- We find a Lyapunov function G from to M such that the sequence (G(??n, 
decrease along the iterates. 

- We also prove that dominates c||r 7 ^|p when we want to have a bound on 

Wnf of the form = lG{9l - dl,0). 

For readability, we remove the index i and take hi = 1 without loss of generality. 


D.3 Proof of Lemma 3 

We first consider a quadratic Lyapunov function ( | Gi ( ] with Gi = ( ^ ^ ^ 

\r]n-i J \Vn-i J \a-l 1-a^ 

We note that Gi is symmetric positive semi-definite for a < 1. We recall Fi = ^ + /^) ^ 

For the result to be true we need for 0 < a < 1 and 1 — \/l ~ ot < (3 < 1 + yjl — a two 

properties: 

(19) 

^ Gi. (20) 


FjGiFi ^ Gi, 
'1 0 ^ 


and 


a 


0 0 


Proof of Eq. (20). We have: 

Gi - a (J = (1 - a) I) ^ 0 for a < 1. 


Proof of Eq. (19). Since /3 i—>■ F)(/3)''~GiF)(/3) — Gi is convex in /3 (Gi is symmetric 
positive semi-definite), we only have to show Eq. (19) for the boundaries of the interval in 
/3. For X e M^: 


2 \ T 

X — X X 


1 —X" 

— 


X — X X 

1 0 


1 —X" 


— X 


= —(1 — X 


2\2 


1 0 
0 0 


^ 0 . 


This especially shows Eq. (19) for the boundaries of the interval with x = ±\/l ~ 
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Bound. Thus, because ryo = 0, we have 


OtT^i+l < < &n-lGlQn-l < Oq <^100 < 

This shows that for 0 < a < 1/hi and 1 — \/l — ahi < /3hi < 1 + y'l — ahf. 


2 < Ml 

“ ah 


D.4 Proof of Lemma 4 

We consider now a second Lyapunov function G 2 ('i]n,'nn-i) = {hn — Thn-iY ~ 
We have: 


G2(r/n,ryn-i) 


{r]n - rr]n-if - 

(rry„_i - (1 - I3)r]n-2f - ^hl-i 

(r^ - A)ry2_^ + (1 - l3fvl-2 “ 2(1 - P)rr]n-ir]n-2 

((1 - P)vl-i + (1 - P){r‘^ - ^)hl-2 - 2(1 - l3)rrjn-ir]n-2 

(1 - /3)[ihn-i - rr]n-2f - A(ryn_2)^]. 

(1 - P)G2{rin-l,r]n-2)- 


Where we have used twice — A = (1 — /3) and rjn = 2rr]n-i — (1 — /3)?7„_2. Moreover 
G 2 {r]n,'nn-i) can be rewritten as: 

i \ _ /-I a l3. . \2|^ \2 

G2{'f]n^fln—l) — (X ^ jyhn ~ hn—l) H ^ (^n—l) H ^ (^n) • 


Thus for a + /3 < 2 and /3 < a we have: 

|(??n)^ < G 2 {Vn,rin-i) = (1 -/3)”"^G2(r/i, T/o) = (1 - 


Therefore for a + /3 < 2//ij and X < a, we have: 


< 


Ml. 

ahi 


D.5 Proof of Lemma 5 

We may write ry„ as 

rjn = rr]n-i + (r+)’^ + (r_)”. 


Moreover, we have: 


|(r+r + (r_n<2. 


therefore for a + /? < 2, 


|?/n| < r\T]n-l\ + 2 <2 


1 -r" 
1 — r 


< 2 


( 2 ^) 
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Thus 


2 




Moreover for all u G [0,1] and n > 1 we have 1 — (1 — u)^ < \/nu, since 1 — (1 — u)'^ < 1 

and 1 — (1 — uY = u — u)^ < nu. Thus 





Therefore for Q < a <2lhi and a + j3 < 2/hi we have: 



E Lower bound 


We have the following lower-bound for the bound shown in Corollary 1, which shows that 


depending on which of the two terms dominates, we may always find a sequence of functions 
that makes it tight. 

Proposition 1. Let L > 0. For all sequences 0 < an < 1/T and 0 < (dn < “2/L — an, such 
that an + I3n = o{nan) there exists a sequence of one-dimensional quadratic functions {fn)n 
with second-derivative less than L such that: 



For all sequences 0 < a;,i < 1/L and 0 < Pn <2/L — an, such that nan = o{an + Pn), there 
exists a sequence of one-dimensional quadratic functions {gn)n with second-derivative less 
than L such that: 


(1 - exp(-2))^ ||6»o - 


limn(a„ + /dn){9n{0n) - gn{0Y) 


4 


Proof of the first lower-bound. For the first lower bound we consider 0 < < 1/T 


and 0 < f3n < 2/L — a, such that On + /3n = o(na„). We define /„ = 7r^/(4a„n^) and we 
consider the sequence of quadratic functions fn{0) = 2 ~’ consider the iterate {gn)n 

defined by our algorithm. We will show that 



We have, from Lemma 2, 


vlfn _ gjsm‘^{uJnn)pi 


.2n 


n 
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Tr^jan+hnF \ ' 
(4a7np ) 













Moreover 


2n I 1 V f t f 1 \ i , n^ 

Pn=[^--, -O =exp nlog 1---^ =l + o(l), 


4a„n2 J \ \ ianu"^, 

since ^ = o(l). Also, 1 - "" (Aa^n)^'^ = 1 + o(l), since «„ + /3n = o(na„). Moreover 


sin(a;„) = 






Pn \/l f^nfn 

thus tdn = tt/( 2n) + o(l/n) and sin(na;n) = 1 + o(l). 


= TTj{2n) + o(l/n), 


Proof of the second lower-bound. We consider now the situation where the second 
bound is active. Thus we take sequences (on) and (/3n), such that nan = o{an + Pn)- We 
define gn = ^ 

We will show for the iterate (ry„) defined by our algorithm that: 


~ n{a +/3 ) (q, )2 and consider the sequence of quadratic functions gniP) = 

1 ■ 

(1 - exp(-2))^ ||6»o - 


limn(a„ + Pn){gn{0n) - gn{6*)) = 
We will use Lemma 1. We first have 


An = 


Oin + Pn\ 2 


Qn ^nQn — Qn 


“1“ pn \ 1 


n 


Thus {nAn)/gn = 


- ' ) and 


\/^ = 


i| + 


2a. 


n J n{an + Pn) 


II 2o„„ 


n V an + Pn 

1 On ( On 

- H-+ o 


Moreover 


Thus 


rn = l- 


n an T Pn V an T P\ 


an T Pn 1 2an 

--- gn = 1- 


n an T Pn 


r+ = 1-+ o 


and 


Q^n+/3n Pn J 

r\ = exp(nlog(r+)) = exp + o( ^ ) = ^ + '^(1)- 


Furthermore 


H“ pn J V P'‘ 


2 3 (^ 77 , / 0^77 

V- = 1 -^—h o 


^ ^n+Pn P‘f^ J 
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and 


/ 3Q!n,n ^ / na 

r_ = exp(nlog(r+)) = exp ( —2-^| + o 


Oin “1“ Pr, 


Pri 


= exp(-2) + o(l). 


Thus 

Finally, we have: 


K-rl!)2 = (l-exp(-2)f+ 0(1). 


TL ^ni2 


{an + f3n)n[gn{en)-gn{e.)] = - 6 


2n 

1100-0 


4A^/ g^^ 


+1 ^rl-rlf 


||0O — 0*|P /, / r)',N2 , /I X 

--(l-exp(-2)) +o(l). 


F Proofs of Section 4 


F.l Proofs of Theorem 3 and Theorem 4 


We decompose again vectors in an eigenvector basis of H with = pj gn and = pj Sr, 


Pn+i = (1 - ahi)gl, + (1 - (3hi){g\, - gl,_^) + {na + /3)e(,+i. 


We denote by = 


_ /[na + /3]e(,+i 


and we have the reduced equation: 


Q\+,=F,Q\ + en+i- 

Unfortunately Fi is not Hermitian and this formulation will not be convenient for calculus. 
Without loss of generality, we assume r~ ^ rf even if it means having r~ — rf goes to 0 in 

I 1 ) transfer matrix of Fi, i.e., Fi = QiDiQ~^ with 

Di = and Q~^ = ^ ^ . We can reparametrize the problem in the 

following way: 

Q“'0Ui = Q^^FS'n + Q-^iUi 

= Q-^F,Q,Q-^Q\ + Q-^en^, 

= A(gr'0n) + Qr'en+i- 

With 0^ = and now have: 

©1+1 = A0j. + C+i, (21) 

with now Di Hermitian (even diagonal). 
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Thus it is easier to tackle using standard techniques for stochastic approximation (see, e.g., 
Polyak and Juditsky, 1992; Bach and Moulines, 2011): 

n 

k=l 


Let Mj = 




we then get using standard martingale square moment inequalities, 


0 0 

since for n ^ m, £\ and are uncorrelated (i.e., E[e^e^] = 0): 




n—k?i ||2 


k=l 


This is a bias-variance decomposition; the left term only depends on the initial condition 
and the right term only depends on the noise process. 

We have with Mi = , MiQ-^ = , and MSi = . Thus, 

we have access to the function values through: 


Moreover we have 0 q = 


( (A/ifi -rt)\ 

\-(t>\/{ri -rf)} 


. Thus 




||M,A"0of = {4>\?h. 


{{rtT-{r-r) 

(^+ _ )2 


This is the bias term we have studied in Section 3.3 which we bound with Theorem 2. The 
variance term is controlled by the next proposition. 

Proposition 2. With E[(e^)^] = a for all n G N, for a < 1/hi and 0 < /3 < 2//ii — a , we 
have 


1 ” 


n—kti II2 


< 


k=l 


mm 


2(an + (3y 


0/3(4 - (a + 2/3)/ii)n2 hi 


16 {na + py Ci (an + py 8(na +/3)^ 1 

n{a + yy hi' na *’ a + fd *J 


The last two bounds prove Theorem 3. 

We note that if we restrict /3 to /3 < 3/{2hi) — a/2, then 4 — (a + 2jd)hi > 1 and the 
first bound of Proposition 2 is simplified to /f ■ This allows to conclude to prove 

Theorem 4. 
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F.2 Proof of Corollary 3 


We let v = 


\\eo-e. 


and consider three different regimes depending on v and L. 


y/LtT{CH- 

If < 1/L, we have u/N < 1/L and thus a = v/N and (5 = v. Therefore 




A^2 


a 


a 


■I3N‘^ 


\\0Q-e,f ^ AiY{CH-^) 


< 


< 


uN N 

y/Ltv{CH-^)\\eo-e, 


N 


+ 


4tr(CF-i) 


5ti{CH 

w 


-1^ 


where we have used y/L\\9o — ^*|| < Y^tr(C-ff since u < 1/L. 

li V > 1/L and v < N/L, we have a = v/N and (3 = 1/L. Therefore 

(oiV^ 1 \\9o-d4^ , 4tr(Cg 

N'^a a/3N‘^ ^ “ vN LvN 


-1^ 


< 


< 


+ 


.yLtT{CH-^)\\9o-94 , 4tr(CF-i) 
N 

54LtT{CH-^)\\9o-9^ 


N 


N 


where we have used y/L\\9o — 0*|| > y^tr(C-ff since v > 1/L. 
11 v>N/L,wq have a = 1/L and (5 = 1/L. Therefore 

L\\9o-94^ 




+ tr(CF-^) 


N'^a 




iV2 


< 


< 


L\\9o-94^ ^ L\\9o-94^ 


A^2 

2L\\9o-94^ 

N2 


iV2 


where we have used that the real bound in Proposition 2 is in fact in (N — l)a + (3, (see 
Lemma 6) and that tT{CH~^) < since v > N/L. 


F.3 Proof of Proposition 2 
F.3.1 Proof outline 

To prove Proposition 2 we will use Lemmas 6, 7 and 8, that are stated and proved in 
Section F.3.2. 

We want to bound IE[X]fc=i according to Lemma 6, we have an explicit 

expansion using the roots of the characteristic polynomial: 

































Thus, by bonding {k — l)a + /3 by (n — l)a + /3, we get 


k=l 




k=l 


{^i -rjf 


Then, we have from Lemma 7 the inequality: 


^ [(r-)^ - {rtff ^ 2-^K 

^ [iri)-irtW 4a/3/i2(l - (ia+i/3)/ii)' 


( 22 ) 


Therefore 


k=l 


E[e*^] ((n 
hi 


l)a + /3)^ 2-phi 

4a/3 (1 - (4 q; + 4/3)/ij) 


This allows to prove the first part of the bound. The other parts are much simpler and are 
done in Lemma 8. Thus, adding these bounds gives for a < l//ij and 0</3<2//ij — a: 


n 

^ k=l 

. f 2(a(n — 1) +/3)^ c 16((n — l)a; +/3)^ c (Q;(n — 1) +/3)^ 8((n — l)a +/?)^ 

\ a/3n2(4 — (a + 2/3)/ii)/ij ’ n(a +/3)2 /i* ’ na *’ a +/? 


F.3.2 Some technical Lemmas 


We first compute an explicit expansion of the noise term as a function of the eigenvalues of 
the dynamical system. 

Lemma 6. For all a <1/hi and 0</?<2//ij — a we have 


E\\MiD^^ilf = hi{{k - 1)0 + pfE[{e^ 



Proof. We first turn the Euclidean norm into a trace, using that tr[a4i?] = tr[71a4] for two 
matrices A and B and that tr[x] = x for a real x. 

= TiDr’^M^M^Dr’^EiHiilf], (23) 


This enables us to separate the noise term from the rest of the formula. Then we compute 
the latter from the definition of in Eq. (21) : 


EK(a)’"i 


{{k - l)o + /3)^ 

ir~ -rfy 


E[(e*)^] 




And the first part of Eq. (23) is equal to: 


MiDf-^ = hi 


_ N 2(n—/c) 


[fi) 


(n—k) 




{>■.+ ) 


2(n—k) 
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(t- 0\ ^l/2\ 

because A = f q ^+\ and M* = f 0 j 

EIIA^Dr-^aip = ;,, (('^ -_l)° + ^ i;|/"i|( -)-t _ (r+)»-‘]^ 

□ 


In the following leamma, we bound a certain sum of powers of the roots. 
Lemma 7. For all a < 1/hi and 0</3<2/hi — a we have 


[c 


-\k 


— (rl 


\k^2 


2 -Phi 


S 4a/3/i|(l - (|a + 


We first note that when the two roots become close, the denominator and the numerator 
will go to zero, which prevents from bounding the numerator easily. We also note that this 
bound is very tight since the difference between the two terms goes to zero when n goes to 
infinity. 


Proof. We first expand the square of the difference of the powers of the roots and compute 
their sums. 




k=l 


k=l 


_2n 


1 1 -’’i ol - (^t^i ) 

, o ' ■" 


1 - r7 


1 — r, 


_2 


1 - (r+r. 


1 — r 


1 1 

+ 


+2 


_2 

t ^ ' i 

1 1 

+ — 


l-r" l-(A^) 


_2n 


_ 2n 


+ 


1 +2 1 -2 
I-rp 1 - r. 


(rp^fp~\n 

_ 2 \ i i y 

1 - (rt^P) 


1-rf 1-rf 




with = 


1-U 


+ 


1 — 


- 2 


l-(r+r. ) 


This sum is therefore equal to the sum of one term we will compute explicitly and one other 
term which will go to zero. We have for the first term: 


1 — r, 


+ 2 


+ 



1 - (r+r. 


(1 - rf)il - (r+r-)) - (1 - rf)il - rf ) 

(l-rf)(l-r-")(l-(r+r-)) 

^ (1 - rf){l - (r+r-)) - (1 - rf){l - rf) 

(l-rf)(l-r-")(l-(r+rr)) 


33 



with 


(1 - rf ){l - (r+rr)) - (1 - rf){l - rf) = (1 - rf)[{l - (r+r")) - (1 - rf)] 

= -r-), 

d 

(1 - rf){l - (r+r-)) - (1 - rf){l - rf) = -r-(l - rf){r+ - r"), 


fO--'f’f){rt-r~)-r~{l-rf){rf-r~) = {f - r~)[rf {I - rf) - r~ {I - rf)] 

= if - ) f - r~ + rf r," (r+ - )] 

= f -r~)‘^[l + rtr-]. 




Therefore the first term is equal to: 


1 1 
1 — rj 1 — r, 


_ 2 ^ f -rj )^[l + r-+r. ] 

-2 l-(r+r-) {^i-rf){l-rf){l-{rtr-)y 


and the sum can be expanded as: 

[(r-)* - ('•f )'=]" 

iZ—/ \j.~ — f~yx^ 

k=i ^ ^ ^ J 


_ ] _ 

(l-r+')(l-r-')(l-(r+r-)) 


with Jn = ,, +'12 • 

[Ti )-(^ )] 

Then we simplify the first term of this sum using the explicit values of the roots. We recall 
rf = ri± y/Ki = 1 - f^hi ± \l(ff hf - ahi, therefore 


r~^ r ■ = — A ? 

I ' Z ' I I 


= 1 - 


cx 13 


hi - 


oi (3 


hi + ahi 


= I - fihi 


(l-rf)il-rf) = [(l-r-)(l-r+)][(l + r+)(l + r+)] 

= [(1 -ri + fAi){{l -ri- ■\/A^)][(1 + r* + \/A^)((1 + r* - ff)] 
= [(1 - rf - Ai] [(1 + rf- Ai] [(1 - rf - Ai] 

= 4a/ii - Qa + /ijV 


f [(r. - {rf)f 

h [f)-f)V 


2-Phi 


[(^i ) - ff AaPhjil - {\a + \P)hi) 
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Even if will be asymptotically small, we want a non-asymptotic bound, thus we will 
show that Jn is always positive. 

In the real case [(r”) — > 0 and using + 6^ > 2ab, for all (a, b) G M^, we have 


r- 


j^2n 


_ 2n 


1 - r] 


+ 2 


+ 


1 — r, 


_2 — 


> 2 


y 




_ 1_2 _2 _ _ 

and using + r^ > 2r- we have 


-rf)(l-r-')<l-(r+r-). 


Since 


(1 - r+ )(1 - r- ) - [1 - (r+r-)]2 = 1 - r+ - r" + {rfr-f - 1 + 2r+r- - (rfr-f 

o + - +2 -2 

= 2r, -E 

< 0 . 


Thus 


_^2n 

_j_ 


_2n 


_ 2 ^ — > 0 . 


l-rf l-rf 
and Jn > 0 in the real case. 

In the complex case, [(r“) — (rJ)]^ < 0, and using < 2zz for all z G C, we have 

_2n 


j^2n 


1 +2 ' _2 - 
I-rj 1 - r. 


< 2 


(E 




and using r^ < 2r^ we have 


^{l-rfyi-rf) > 1 - (r+r-). 


Thus 


_j_2n 


1 — r. 


+ 2 


r- 

+ * 


_ 2n 


1 — r, 


_2 


_ 2_w_4J— < 0. 


1 - {rJr^ 


and Jn > 0 in the complexe case. 


Therefore we always have: 


Jn > 0, 


^ [{u)-{rt)? ~ ^al3hy{l-{\a+\l3)hi)' 


□ 


35 



However we can also bound roughly Eq. (22) using Theorem 2 since we recall we have 
rf^ = ^ —i-. This gives us the following lemma which enables to prove the 

w i ) 

second part of Proposition 2. 

Lemma 8 . For all a <1/hi and 0</?<2//ij — a we have 




1/2 j~.n—k^i ||2 


k=l 


. < El(e') ]n{(„ - l)a + min ^ ^ ^ 


2 8n 


16 


Proof. From Lemma 6 , we get 

n 

E iimPd: 




r('„-\n-fc _ ^„+'vn-fc]2 


fc=i 


A:=l 


^2^/2 -N , f 2 


< hjE[(e*) ]((n — l)a +/3)^nmin < 


(^. _^+)2 
8n 


16 


\ ahi ’ (a + /3)/ii ’ (a + 


< E[(e*) ]n((n — l)a +/3)"^ min < —, 


2 8n 


16 


a ’ a + /? ’ hj(a + /3)2 J 


□ 


G Comparison with additional other algorithms 


G.l Summary 


When the objective function / is quadratic and for correct choices of step-sizes, the AC-SA 
algorithm of Lan (2012), the SAGE algorithm of Hu et al. (2009) and the Accelerated RDA 
algorithm of Xiao (2010) are all equivalent to: 


n — 2 


^n+l — [I ^n+lHn+l\(^n ^n+lHji-\-l\{9n 9n—l') T j 


where we use HnO + En as an unbiased estimate of the gradient and 5n as step-size which 
values will be specified later. 


Lan (2012) and Hu et al. (2009) only consider bounded cases by projecting their iterates 
on a bounded space. Xiao (2010) deals with the unbounded case and prove the following 
convergence result: 


Theorems. (Xiao, 2010, Theorem 6). With¥,[en® £n\ 
'■y <llL, we have 


E/(0n) - 


= C, for step-size 6n < 

nycr^ tr C 

3 ■ 


^^7 with 

n ' 


This result is significantly more general than ours since it is valid for composite optimization 
and general noise on the gradients. 

We now present the different algorithms and show they all share the same form. 
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G.2 AC-SA 


Lemma 9. AC-SA algorithm with step size 7 ^ and fdn and gradient estimate Hn+iOn + Sn+i 
is equivalent to: 

On+l = {I- ^Hn+l)en + ~ " ^n-l) + 

Pn Pn Pn Pn 


Proof. We recall the general AC-SA algorithm: 


• Let the initial points = xi, and the step-sizes {/3n}n<i and {7n}n<i be given. 
Set n = 1 

• Step 1. Set = fd~^Xn + (1 - ldn^)xn^, 

• Step 2. Call the Oracle for computing where E[G(x™'^, ^n)] = f'{x'ff^). 

Set 

X-n+l — Xn '^nGix.^ 

= l3n^Xn+l + (1 - l3n^)Xn^, 

• Step 3. Set n —)■ n -|- 1 and go to step 1. 


When / is quadratic we will have G{x^^,f,n) = Hn+ixff^^ — Sn+i, thus x^+i = Xn — 
'ynHn+ix^'^ + jnSn+i, and: 


= /3“^Xn+i + (1 -/3“^)x“® 

= /3~^{Xn - -tnHn+lXn’^ + qn^n-hl) + (1 “ Pn^)xf{‘ 

= l3~^{/3nXff'^ + (1 - Pn)xff - -/nHn+lx'ff'^ + -/nen+l) + (1 “ /^n 

— rrind _ bh TT md I 7n ^ 

Pn Pn 


md 

^n 


X'C'~ = l3n^Xn + {I - ldn^)x\ 

3-1 


3-1 

n 

3-1 


ag 

n 


= fin 'Pn-ix'if + /3n-l)Xi (1 - /3„ ^)x 




— xff + 


fdn-l — 1 

(dn 


IrrO-g _ r^°-9 1 


These give the result for On = Xn^. 


□ 


G.3 SAGE 

Lemma 10. The algorithm SAGE with step-sizes Ln and an is equivalent to: 

dn-\-l = ~ 1/ Ln-\-lHn+l)0n + (1 — On) ~ 1/ Ln+lHn-i-l]idn — On-l) + 1/ Ln+l^n+1- 

(Xn 
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Proof. We recall the general SAGE algorithm: 


• Let the initial points xq = zq = 0, and the step-sizes {j3n}n<i and {Ln}n<i be given. 
Set n = 1 

• Step 1. Set Xn = (1 - an)yn-l + anZn-l, 

• Step 2. Call the Oracle for computing G{xn,f,n) where E[G(x„,^n)] = f'ixn)- Set 

Vn — Xji 

Zn = Zn-1 - a~^{Xn “ Vn) 

• Step 3. Set n —)■ n + 1 and go to step 1. 


We have 


and 


Vn — {,1 ^/LnHri)Xn “t“ 'yn^n: 


Zn — Zn—1 Vn) 

— Zn—l Oln [(^ 0:n')yn—l P OlnZn—1 yn] 

— yn CVn (1 Oln)yn—l- 


Thus 


Xn 


(1 Oin)yn—l T C^nZn—1 

(1 Cin)yn—1 T ®n[®yi_iyn—1 CKyi_i(l ®n—l)yn— 2 ] 

yn—l T (1 Oin—l) [yn—l l/n— 2 ]- 

®n—1 


These give the result for 9n = yn- 


□ 


G.4 Accelerated RDA method 

Lemma 11. The algorithm AccRDA with step-sizes (3 and an is equivalent to: 

^n+1 — (yl 'ln-\-\HYi-\-\)9yi “h (1 Oifi) [I ^Ti+l^n+l] (^n 1) H“ 'T'M+I^R+1 ? 

CXfi 

with-in = 

Proof. We recall the general Accelerated RDA method: 

• Let the initial points wq = uq, Aq = 0, = 0 and the step-sizes {an}n<i and {/3n}n<i 

be given. 

Set n = 1 
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• Step 1. Set An = A^-i + cxn and 9n = 

-^n 

• Step 2. Compute the query point Un = {1 — 9n)wn-i + 9nVn-i 

• Step 3. Call the Oracle for computing = G{un,^n) where E[G(tt,i, .^„)] = f'{un), 
and update the weighted average gn 

9n — (1 9fi')gn—l “1“ 9ngn- 


• Step 4. Set Vn = vq - 

• step 5. Set Wn = (1 - 9n)Wn-l + 9nVn- 

• Step 6. Set n —)• n + 1 and go to step 1. 


First we have 

Vn = Vo 


= Vo 


L + (3n 

An 


9n 


[(1 ^n)9n—l ^n9n\ 


L + Pn 

^0 “F i 9n)gn—l ^n(^n+l^n ^n+l)] 


L + I3n 

= Uq + (1 — 9n) 


An{L + Pn-l) ^ 
{L + /3n)An-l 


Vn-1 - 


L + Pn 

CXfi 




= Uo + (1 - ^^ Vn-1 - (Hn+lUn + En+l)]- 

+ Pnj^n-1 ^ + Pn 

With (3n = 13 we have Vn = Vn-i - -^{Hn+iUn + Sn+i)] and 

Wn- (I- j^Hn+l)Un + 


Since Vn-i = - 9n-i)wn-2, then 

Vn — (1 9n)'(Vn—l “t 1 ^n—l)^n— 2 )) 


and 

0<nAn—2 r 1 

Un = Wn-1 H- ^[Wn-1 “ Wn-2\- 

C^n—l^n 

□ 


H Lower bound for stochastic optimization for least-squares 

In this section, we show a lower bound for optimization of quadratic functions with noisy 
access to gradients. We follow very closely the framework of Agarwal et al. (2012) and use 
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their notations. The only difference with their Theorem 1 in the different choice of two 
functions and /“, which we choose to be: 

/f (x) = Ci{Xi ± 

with a non-increasing sequence (cj) to be chosen later. The function that is optimized 
is thus; 

9c{x) = aiS)f+{x) + {\- 0^6) f- (x)}. 

This function is quadratic and its Hessian has eigenvalues equal to 2cijd. Thus, its largest 
eigenvalue is 2ci/d, which we choose equal to L. 

Noisy gradients are obtained by sampling d independent Bernoulli random variables bi, 
i = l,...,d, with parameters -|- aid) and using the gradient of the random function 
a Yli=i + (1 “ The variance of the random gradient is equal to 


'^ = E 


^vai(bi[ci{xi + r/2) 


Ci{xi - r 


1 


i=l 



The function is minimized for x = —adr, and the discrepancy measure between two 
functions g^ and gi^ is greater than 


dk 


inf{/+(x)+/. (x)}-inf/+(x)-inf/. (x)|l„.^^ 

X ' X X 


> - 




3c,r^(5^ 


d^ 


1 ‘iCdV^d'^ 
^ d —4— 


Since the vectors a,/3 G { — 1,1}'^ are so that their Hamming distance A{a, (3) ^ d/A for 
a ^ 13, we have a discrepancy measure greater than Thus, for a an approxi- 

2 f 2 

mate optimality of e = , we have, following the proof of Theorem 1 (equation (29)) 

from Agarwal et al. (2012), for N iterations of any method that accesses a random gradient, 
we have: 


1/3 ^ 1 - 2 


IQNdd'^ + log 2 
dlog(2/Ve) 


2 

Thus, for d large, we get, up to constants, d‘^ ^ 1/N and thus e ^ 

For Cl = 2Ld and c, = Ly/d for the remaining ones, we get (up to constants): 


e ^ 


vVd 

lIv ■ 


This leads to the desired result for N ^ d. 
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